Blind Construction of Optimal Nonlinear Recursive Predictors for 

Discrete Sequences 
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Abstract 

We present a new method for nonlinear predic- 
tion of discrete random sequences under minimal 
structural assumptions. We give a mathematical 
construction for optimal predictors of such pro- 
cesses, in the form of hidden Markov models. We 
then describe an algorithm, CSSR (Causal-State 
Splitting Reconstruction), which approximates 
the ideal predictor from data. We discuss the re- 
liability of CSSR, its data requirements, and its 
performance in simulations. Finally, we compare 
our approach to existing methods using variable- 
length Markov models and cross-validated hid- 
den Markov models, and show theoretically and 
experimentally that our method delivers results 
superior to the former and at least comparable 
to the latter. 



1 Introduction 

The prediction of discrete sequential data is an impor- 
tant problem in many fields, including bioinformatics, 
neuroscience (spike trains), and nonlinear dynamics 
(symbolic dynamics). Existing prediction methods, 
with the exception of variable-length Markov model 
(VLMM) methods, make strong assumptions about 
the nature of the data-generating process. In this pa- 
per, we present an algorithm for the blind construc- 
tion of asymptotically optimal nonlinear predictors of 
discrete sequences. These predictors take the form of 
minimal sufficient statistics, naturally arranged into 
a hidden Markov model (HMM). We thus secure the 
many desirable features of HMMs, and hidden-state 
models more generally, without having to make a pri- 
ori assumptions about the architecture of the system. 
Furthermore, our method is more widely applicable 
than those based on VLMMs. We also compare our 
approach to the use of cross-validation to select an 
HMM architecture, and find our results are at least 
comparable in terms of accuracy and parsimony, and 
superior in terms of speed. For reasons of space, we 
omit proofs here. These can be found in ^ ch. 5] and 
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12] . The source code and documentation for an imple- 
mentation of CSSR are at http : //bactra. org/CSSR/ 

2 Optimal Nonlinear Predictors 

Consider a sequence of random variables X t drawn 
from a discrete alphabet A. A predictive statistic is 
a function r\ on the past measurements X^_ x . We 
want to predict the process, so we want the statis- 
tic to summarize the information X^^ contains about 
That is, we wish to maximize the mutual in- 
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formation between the statistic ^(X^) and the fu- 



ture X™ lt i.e. 



in the standard notation, maximize 
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JfrpfioJ; X^}, which can be at most I[X l _, 
A sufficient statistic is one which reaches this level. 
This implies r\ is sufficient if and only if r)(x~) = rj(y~) 
implies P(A- 1 |Xi oc = x~) - V{X^\X^ = y~) 
|31 • Decision theory shows that optimal prediction re- 
quires only knowledge of a sufficient statistic 0]. It 
is desirable to compress a sufficient statistic, so as to 
minimize the information needed for optimal predic- 
tion. One sufficient statistic r\\ is smaller than an- 
other 772 if r\\ can be calculated from 772. A minimal 
sufficient statistic is one which can be calculated from 
any other sufficient statistic. Minimal sufficient statis- 
tics thus are the most compact summary of the data 
which retains all the predictively-relevant information. 
We now construct one, following [5] and UJ. 

Say that two histories x~ and y~ , arc equivalent when 
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equivalence class of a: is [x ]. Define the function 
which maps histories to their equivalence classes: 



e(x ) = [x } 
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The possible values of e, i.e., the equivalence classes, 
are known as the "causal states" of the process; each 
corresponds to a distinct distribution for the future. 
(We comment on the name "causal state" below.) The 



state at time t is S* 



e(X* 



Clearly, e(x ) is 



a sufficient statistic. It is also minimal, since if r\ is 
sufficient, then r\(x~) = r/(y~) implies e(x~) — e{y~)- 
One can further show [5] that e is the unique minimal 
sufficient statistic, meaning that any other must be 
isomorphic to it. 

The causal states have some important properties [5]. 
(1) {S t } is a Markov process. (2) The causal states 
are recursively calculable; there is a function T such 
that St+i = T(St,X t+ i). (3) One can represent the 
observed process X as a random function of the causal 
state process, i.e., there is naturally a hidden-Markov- 
model representation. (The familiar correspondence 
between HMMs and state machines lets us re-phrase 
the second property as: the causal states form a deter- 
ministic machine.) We will refer to causal state models 
or causal state machines. 

The construction of the causal states is essentially the 
same as that of the "measure-theoretic prediction pro- 
cess" introduced by Frank Knight [7], though that is 
framed directly in terms of the conditional distribu- 
tions. Both can be regarded as applications of Wes- 
ley Salmon's concept of a "statistical relevance basis" 
[E] to time series. Causal states are also closely re- 
lated to the "predictive state representations" (PSRs) 
of controlled dynamical systems due to Littman, Sut- 
ton and Singh 9 , though PSRs are not generally min- 
imal. (There is currently no discovery procedure for 
PSRs, though there are ways to learn the parameters 
of a given PSR JOj.) It is not clear that the "causal 
states" really are causal in the strong sense of e.g. [TT] . 
but this needs investigation. Meanwhile, they need a 
name, and "causal states" is less awkward than the 
others. 

Our algorithm for inferring the causal states from data 
builds on the following observation pp. 842-843]. 
Say that r\ is next-step sufficient if I[Xt+i; ^(Xt.^)] = 
I[X t+ i; Xtoo]- A next-step sufficient statistic contains 
all the information needed for optimal one-step-ahead 
prediction, but not necessarily for longer predictions. 
If 77 is next-step sufficient, and it is recursively calcu- 
lable, then 7] is sufficient for the whole of the future. 
Since e satisfies these hypotheses, the minimal suffi- 
cient statistic can be found by searching among those 
which are next-step sufficient and recursive. 

3 Causal-State Splitting 
Reconstruction 

We now describe an algorithm, Causal-State Splitting 
Reconstruction (CSSR), that estimates an HMM with 
the properties described in the last section from se- 
quence data. CSSR starts by "assuming" the process is 
an independent, identically-distributed sequence, with 
one causal state, and adds states when statistical tests 



show that the current set of states is not sufficient. 

Suppose we are given a sequence x of length N from a 
finite alphabet A of size k. We wish to derive from this 
an estimate e of the the minimal sufficient statistic e. 
We will do this by finding a set of states S, each mem- 
ber of which will be a set of strings, or finite-length 
histories. The function e will then map a history x~ 
to whichever state contains a suffix of x~ (taking "suf- 
fix" in the usual string-manipulation sense). Although 
each state can contain multiple suffixes, one can check 
[2] that the mapping e will never be ambiguous. (This 
contrasts with the variable-length Markov models de- 
scribed in Sec. 15.11 below, where each state contains 
only a single suffix.) 

The null hypothesis is that the process is Markovian 
on the basis of the states in X; that is, 

P(Jf t |X*li - oati+i) = P(X t \S = eOcJli+i)) (1) 

for all a £ A. That is, adding an additional piece of 
history does not change the conditional distribution 
for the next observation. We can check this with a 
standard statistical test, such as \ 2 or Kolmogorov- 
Smirnov (which we used in our experiments below). 
If we reject this hypothesis, we fall back on a re- 
stricted alternative hypothesis, which is that we have 
the right set of conditional distributions, but have 
matched them with the wrong histories. That is, 
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i +1 ) = P(X t \S = s*) 



(2) 



for some s* G S, but s* ^ £( x t-L+i)- ^ tms hypothe- 
sis passes a statistical test, again with size a, then s* 
is the state to which we assign the history 1 . Only if 
the restricted alternative is itself rejected do we create 
a new state, with the suffix ax t Z L+1 - 

The algorithm itself has three phases; pseudo-code is 
given in Figure^ Phase I initializes S to a single state, 
which contains only the null suffix 0. (That is, is a 
suffix of any string.) The length of the longest suffix 
in £ is L; this starts at 0. Phase II iteratively tests 
the successive versions of the null hypothesis, Eq. ^ 
and L increases by one each iteration, until we reach 
some maximum length £ max . At the end of II, e is 
(approximately) next-step sufficient. Phase III makes 
e recursively calculable, by splitting the states until 
they have deterministic transitions. The last phase is 
not as straightforward as it may seem. 



1 If more than one such state s* exists, we chose the 
one for which P(X t \S — s*) differs least, in total variation 

distance, from ~P(Xt\XlZ L = olx\'Z l+1 ), which is plausible 
and convenient. However, which state we chose is irrelevant 
in the limit N — *• 00, so long as the difference between the 
distributions is not statistically significant. 



There are standard algorithms ^2] to take a non- 
deterministic finite automaton (NDFA) and produce a 
deterministic finite automaton (DFA), which is equiv- 
alent in the sense of generating the same language. 
However, these algorithms do not ensure that each 
state of the DFA, considered as an equivalence class 
of strings, is a subset of a state of the NDFA. In 
the present context, applying one of these algorithms 
would result in states which mixed histories with sig- 
nificantly different conditional distributions for the 
next symbol — we would get a statistic which was re- 
cursive but not next-step sufficient. To preserve proba- 
bilistic information while making the transitions deter- 
ministic, we proceed as follows |2J. We want there to be 
a transition function T(s, b) such that e(x~b) = T(s, b) 
for any x~ £ s. Thus, for each state-symbol pair s, b, 
we check whether i(x~b) is the same for all x~ G s. 
If we find a state-symbol pair where this does not 
hold, we split that state into states where it does 
hold. We then start checking the state-symbol pairs 
all over again, since some other transitions may have 
been altered. This procedure always terminates, leav- 
ing us with a set of states with deterministic transi- 
tions. To do this smoothly, we must first remove any 
transient states which the second phase may have cre- 
ated. These transients are never true causal states 
|13| . but are sometimes useful in filtering applications, 
in which case they can be straightforwardly restored 
from the true, recurrent states |13j . 

P(Xt|Xf~£ = x) may be estimated in several ways; we 
have used simple maximum likelihood. P(A t |S l = s), 
in turn, must be estimated, and we used the weighted 
average of the estimated distributions of the histories 
in s. When L = and the only state contains just the 
null string, P(X t \S = s) = P(X t ), the unconditional 
probability distribution. 

3.1 Time Complexity 

Phase I computes the relative frequency of all words 
in the data stream, up to length L max + 1. There are 
several ways this can be done using just a single pass 
through the data. In our implementation, as we scan 
the data, we construct a parse tree which counts the 
occurrences of all strings whose length does not exceed 
L max + 1. Thereafter we need only refer to the parse 
tree, not the data. This procedure is therefore O(N), 
and this is the only sub-procedure whose time depends 
on N. 

Phase II checks, for each suffix ax, whether it belongs 
to the same state as its parent x. Using a hash ta- 
ble, we can do this, along with assigning ax to the 
appropriate state, creating the latter if need be, in 
constant time. Since there are at most u(k, £ max ) = 



Algorithm CSSR(.A,x, L max , a) 

I. Initialization: L <- 0, £ <- {{0}} 

II. Sufficiency: 
while L < L max 

for each seS 

estimate P (X t \S = s) 
for each x 6 s 

for each a E A 

estimate p *— P(A t |A*~^ = ax) 
Test(£, p, ax, s, a) 
L «- L + 1 

III. Recursion: 

Remove transient states from E 
recursive <— FALSE 
until recursive 

recursive <— True 
for each s <E £ 
for each b e A 

xq «— first x <E s 
T(s,b) <— e(x b) 
for each x G s, x ^ xq 
if e{xb) ^ T(s, b) 
then create new state s'eS 
T(s', b) <- e(xb) 
for each y € s such that 

Kv h ) = K xb ) 
Move(t/, s,s') 

recursive <— FALSE 

Test(E,p, ax, s,a) 

if null hypothesis (Eq. ^| passes a test of size a 
then s <— ax U s 

else if restricted alternative hypothesis (Eq. |2J) 
passes a test of size a for s* 6 E, s* ^ s 
then Move(ox, s, s*) 
else create new state s'eS 
MoVE(ax,s,s') 

Move(x, si,s 2 ) 
Si <— si \ x 

re-estimate P(X t \S = Si) 
s 2 <— s 2 U x 
re-estimate P(X t \S = s 2 ) 



Figure 1: Pseudo-code for the algorithm CSSR. Argu- 
ments: A: discrete alphabet for the stochastic process; x: 
sequence of length A^ drawn from A; £ m ax, maximum his- 
tory length considered when estimating causal states; a, 
size of the hypothesis tests, i.e., probability of falsely re- 
jecting the hypothesis being tested. Newly created states 
are always empty initially. 



(fc Lmax+1 — l)/(fc — 1) suffixes, the time for phase II is 
0(u(fc,L max )) = 0(fc L — ). 

Phase III itself has three parts: getting the transi- 
tion structure, removing transient states, and refin- 
ing the states until they have recursive transitions. 
The time to find the transition structure is at most 
ku(k, Lmax)- Removing transients can be done by find- 
ing the strongly-connected components of the state- 
transition graph, and then finding the recurrent part 
of the connected-components graph. Both these op- 
erations take a time proportional to the number of 
nodes plus the number of edges in the state-transition 
graph. The number of nodes in the latter is at most 
M(fc,-L ma x), since there must be at least one suffix 
per node, and there are at most k edges per node. 
Hence transient-removal is 0(u(k,L max ){k + 1)) = 
( fc L m ax+i + fc L m ax) = 0{k L -™* +1 ). As for refining 
the states, the time needed to make one refining pass 
is A)M(fc,L max ), and the maximum number of passes 
needed is M(fc,L max ), since, in the worst case, we will 
have to make every suffix its own state, and do so 
one suffix at a time. So the maximum time for refine- 
ment is 0(ku 2 (k, L max )) = 0(fc 2Lmax+1 ), and the max- 
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imum time for all of phase III is 0(fc Lmax+1 -\-k L 
fc 2Lmax+1 ) = 0(fc 2Lmax+1 ). Note that if removing tran- 
sients consumes the maximal amount of time, then re- 
finement cannot and vice versa. 

Adding up, and dropping lower-order terms, the to- 
tal time complexity for CSSR is 0(fc 2L » ax+1 ) + 0(N). 
Observe that this is linear in the data size N. The high 
exponent in k is reached only in extreme cases, when 
every string spawns its own state, almost all of which 
are transient, etc. In practice, we have found CSSR to 
be much faster than this worst-case result suggests 2 . 



4 Convergence and Performance 

We have established the convergence of CSSR on 
the correct set of states, subject to suitable condi- 
tions. The proofs, which use large deviations theory 
on Markov chains, are too long to give here, so we 
simply state the assumptions and the conclusions 2 . 

We make the following hypotheses. 

1. The process is conditionally stationary. That 
is, for all values of r, P(X^_ 1 \X t _ <x> = x~) = 



2 Average-case time complexity will depend on the sta- 
tistical properties of the data source. For instance, the 
number of strings of length L max is here bounded by 
u(fe,L max ) ~ fc Lmax . But if L max is reasonably large, and 
the source satisfies the asymptotic equipartition property 
|14l sec. 15.7], only ~ 2 hLmax strings are produced with 
positive probability, where h < log k is the source entropy 
rate. 



P(X™ T+1 \X t+ ^ = x-). 

2. The process has only finitely many causal states. 

3. Every state contains at least one suffix of finite 
length. That is, there is some A such that every 
state contains a suffix of length A or less. This 
does not mean that A symbols of history always 
fix the state, just that it is possible to synchronize 
[15] to every state after seeing no more than A 
symbols. 

Under these assumptions, the reconstructed set of 
causal states "converges in probability" on the true 
causal states, in the sense that 

P(3aT : e(x~) ^e(aT)) -► 

as N — * oo 2, p. 16]. We establish this by show- 
ing that the probability of assigning a history to the 
wrong equivalence class goes to zero as N — * oo. More 
exactly, for a pair of histories x~ , y~ , define the events 

E(x-,y-) = (i(x-) = e(y-))A(e(x-)^e(y-)), 
F(x-,y-) = (e(x-) ? e(y~)) A (e(ar) - e(y~)) . 

Then H pp. 15-16] 

VaT.iT, P(E(x-,y-)UF(x-,y-))^0. 

To establish this fact in its turn, we use large devia- 
tion theory for Markov chains to show that the empir- 
ical conditional distribution for each history converges 
on its true value exponentially quickly pp. 12-13], 
and consequently the probability that any of our es- 
timated conditional distributions differs significantly 
from its true value goes to zero exponentially in N 
[21 pp. 13-15]. The test size a does not affect this 
convergence, becoming irrelevant in the limit of large 
N. (With finite N, of course, a influences our risk of 
making states simply on the basis of sampling fluctu- 
ations.) Under some further assumptions, P(e ^ e) 
actually goes to zero exponentially in N, and then, by 
the Borel-Cantelli Lemma, one has the wrong struc- 
ture only finitely often before converging on the true 
causal states. 

If the states are correct, then another large-deviation 
argument sec. 4.3] gives us a handle on the expected 
prediction error. Since the forecasts made by our pre- 
dictors are distributional, error should be measured as 
a divergence between the predicted distribution and 
the true one. Using the total variation metric 3 as our 
divergence measure, error goes down as A -1 / 2 . This 
is illustrated in Figure 

3 The total variation or L\ distance between two mea- 
sures P and Q over a discrete space A is d(P, Q) = 
^2a£A l-f( a ) — Q( a )\- Scheffe's identity |15| asserts that 
d{P,Q) = 2sup AC _4 |-P(A) — Q(A)|, and consequently < 
d{P,Q) <2. 



Figure 2: The Even Process. Transition labels show the 
symbol emitted, and the probability of making the transi- 
tion. 



Of the three assumptions in our convergence proof, the 
only one which affects the parameters of the algorithm 
is the third, that there is an integer A such that each 
of the true causal states contains at least one suffix 
of length A or less. A is thus a characteristic of the 
underlying process, not CSSR. If L max < A, not only 
does the proof of convergence fail, there is no way to 
obtain the true states. For periodic processes A is 
equal to the period; there are no general results for 
other kinds of processes. 

Rather than guessing A, one might use the largest 
feasible L m ax, but this is limited by the quantity of 
data |17|. Let L(N) be the the maximum L we can 
use when we have N data-points. If the observed 
process has the weak Bernoulli property (which ran- 
dom functions of irreducible Markov chains do), and 
an entropy rate of h, then a sufficient condition for 
the convergence of sequence probability estimates is 
that L(N) < log N/(h + e), for some positive e. If 
L(N) > log N/ h, probability estimates over length L 
words do not converge. We must know h to use this 
result, but logfc > h, so using logfc in those formulas 
gives conservative estimates. For a given process and 
data-set, it is of course possible that L(N) < A, in 
which case we simply haven't enough data to recon- 
struct the true states. 

We have tested CSSR on a variety of real and sim- 
ulated data sources, and here report two simulated 
examples, both binary-valued: the "even process," il- 
lustrated in Figure [3 and a seven-state process used 
in experimental studies of human sequence prediction 
[Tr)| , illustrated in figure (The results of applying 
CSSR to neuronal spike trains will be reported else- 
where.) 

For the even process, the system can start in either 
state 1 or state 2. When in state 1 it is equally likely 
to emit an A, staying in 1, or emit a B, moving to 
2. In 2 it always emits a B and moves to 1. This is 
an HMM, but it is not equivalent to any finite-order 
Markov chain (see below). Figures 0] and illustrate 
the ability of CSSR to get the correct number of causal 
states, the correct transition structure, and the cor- 




Figure 3: A seven-state process, used in |16j to study 
human sequence prediction. Here each state is defined by 
a single suffix, as indicated. All transition probabilities are 
multiples of 1/16. 



rect distributions. Figure also shows the asymptotic 
scaling of the error with N. Curves average over 30 
independent trials at each N; a is fixed to 10~ 3 . Re- 
sults for the seven-state process of Figure 01 are given 
in Section l5?2l 

5 Comparison with Previous Methods 

5.1 Variable-Length Markov Models 

The "context" algorithm of Rissanen JHj and its de- 
scendants [El EDI ED E21 construct "variable-length 
Markov models" (VLMMs) from sequence data. They 
find a set of contexts such that, given the context, the 
past of the sequence and its next symbol are condi- 
tionally independent. Contexts are taken to be suffixes 
of the history, and the algorithms work by examining 
increasingly long histories, creating new contexts by 
splitting existing ones into longer suffixes when thresh- 
olds of error are exceeded :23 . (This means that con- 
texts can be arranged in a tree, so these are also called 
"context tree" or "probabilistic suffix tree" [2S| algo- 
rithms.) 
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Prediction Error versus History Length 

















N = 10 A 6 — i — . 
N = 10 A 5 — x— 
N = 10M •-*£-■ 
N = 10 A 3 e 
N = 10 A 2 — »- 




















" 












" 












. 


. 




. 


. 





Scaled Error versus History Length 



Figure 4: Number of states inferred versus I/ max and N 
for the even process. The true number of causal states is 
2. 



Causal state reconstruction has an important advan- 
tage over VLMM methods. Each state in a VLMM is 
represented by a single suffix, and consists of all and 
only the histories ending in that suffix. For many pro- 
cesses, the causal states contain multiple suffixes. In 
these cases, multiple "contexts" are needed to repre- 
sent a single causal state, so VLMMs are generally 
more complicated than the HMMs we build. The 
causal state model is the same as the minimal VLMM 
if and only if every causal state contains a single suffix. 
This is the case for the process in Fig. where CSSR 
and VLMM methods will give the same results. 

Recall the even process of the last section. Any history 
ending in A, or in an A followed by an even number 
of B's, belongs to state I. Any history terminated by 
an A followed by an odd number of B's, belongs to 2. 
Clearly I and 2 both contain infinitely many suffixes, 
and so correspond to an infinite number of contexts. 
VLMMs are simply incapable of capturing this struc- 
ture. If we let L max grow, a VLMM algorithm will in- 
crease the number of contexts it finds without bound, 
but cannot achieve the same combination of predictive 
power and model simplicity as causal state reconstruc- 
tion (as illustrated by Figures^] and[5J|. Note, too, that 
the causal states for the even process have finite rep- 
resentations, even though they contain infinitely many 
suffixes. 

The even process is one of the strictly sofic processes 
|24U25| . which can be described by finite state models, 
but are not Markov chains of any finite order 4 . Just as 



4 More exactly, each history x~ has a follower set of 
futures x + which can succeed it. A process is sofic if it has 
only a finite number of distinct follower sets, and strictly 
sofic if it is sofic and has an infinite number of irreducible 
forbidden words. 
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Figure 5: Prediction error as a function of L ma x and V 
for the even process. Error is the total-variation distance 
between the actual distribution over words of length 10, 
and that predicted by the inferred states. Top panel: error 
(log scale) as a function of L max . Bottom: error times y~N 
(linear scale). Here 3 < L max < 10, since if L max < 3 CSSR 
cannot find the correct states. With this a, CSSR never 
gets the states right for V = 10 2 , and only sporadically for 
V = 10 3 , so those lines are not on the scaling curve. 



VLMMs cannot handle the even process, they cannot 
handle any strictly sofic process, even though those are 
just regular languages. Causal states cannot provide 
a finite representation of every stochastic regular lan- 
guage ^3], but the class they capture strictly includes 
those captured by VLMMs. 

5.2 Cross- Validation 

A standard heuristic for finding the right HMM ar- 
chitecture is cross-validation [2SI - O ne picks multiple 
candidate architectures, training each one using the 
expectation-maximization (EM) algorithm, and then 
compares their performance on fresh test data, select- 
ing the one with the smallest out-of-sample error. 

To compare the performance of CSSR against this 
baseline, we started with fully-connected HMMs with 



N 


dcv 


d-CSSR 


scv 


SCSSR 


10* 

10 3 
10 4 


1.27 ±0.23 
1.25 ±0.41 
1.15 ±0.02 


1.10 ±0.23 
0.19 ±0.23 
0.02 ± 0.02 


6.6 ±1.5 
5.6 ±1.7 

2.0 ±0 


1.6 ±1.0 
2.2 ±0.1 
2.0 ±0 



N 


dcv 


dcssR 


Scv 


SCSSR 


10 2 


1.41 ±0.23 


0.70 ±0.12 


4.5 ±2.1 


5.1 ±1.5 


10 3 


1.40 ±0.17 


0.21 ±0.06 


5.8 ±2.7 


6.6 ±0.8 


10 4 


1.40 ±0.11 


0.06 ±0.01 


2.3 ±0.7 


7.2 ±0.6 



Table 1: Comparison of the performance of HMM 
cross-validation to CSSR on the even process. dcv, 
total- variation distance between cross-validated HMM and 
the even process; dcssR, distance between reconstructed 
causal state model and the even process; scv, number of 
states in cross-validated HMM; Scssr, number of states 
in reconstructed model. In all cases the numbers given are 
the means over multiple independent trials, plus or minus 
one standard deviation. Recall that < d < 2, and that 
the minimal number of states needed is 2. 



M states, M — 1 to 10. We trained these, using 
the EM algorithm, on N data-points from the even 
process. Our test data consisted of another N data- 
points from an independent realization, and we se- 
lected HMMs based on the log-likelihood they assigned 
to the test data. Following common practice, the ini- 
tial HMM parameters fed to the EM algorithm were 
those for fully-connected models, i.e., every state could 
transition to every other state, and every state could 
emit every symbol. We then calculated, for each cross- 
validated HMM, the total variation distance between 
the distributions it and the even process generated over 
sequences of length 10. (Because cross-validation is 
so computationally intensive, we have only compared 
up to length N = 10 4 .) Table Q] compares this error 
measure for the cross- validated HMMs and for the re- 
constructed causal state models. It also indicates the 
number of states selected by cross-validation, which is 
consistently higher than the number needed by CSSR. 
Table gives the results of a completely parallel pro- 
cedure applied to the seven-state process. 

CSSR, like the VLMM methods, is a constructive ap- 
proach. Cross-validation is not constructive but se- 
lective. In our case, starting with fully-connected 
models (which, again, is a standard heuristic), cross- 
validated expectation-maximization never selected a 
model whose structure corresponded to the minimal 
sufficient statistic of the data-generating process. In 
both cases, the generalization of HMMs with more 
states worsened as the data-length grew, so cross- 
validation increasingly favored small HMMs which, 
while bad predictors, at least did not over-fit. Had 
models with the correct structure been in the initial 
population of candidates, they doubtless would have 
done quite well, and the gap in predictive performance 
between CSSR and cross-validation would be much 
smaller. Even when we have such prior architectural 
knowledge, CSSR will typically be faster than cross- 
validated EM, which involves performing nonlinear op- 
timization on multiple model structures. 



Table 2: Comparison of the performance of HMM cross- 
validation to CSSR on the seven-state process. Variables 
are as in0 

6 Conclusion 

We have described an algorithm, CSSR, for the unsu- 
pervised construction of optimal nonlinear predictors 
of discrete sequences. The predictors take the form of 
minimal sufficient statistics, arranged naturally into 
a hidden Markov model. CSSR's time complexity is 
linear in the data size. It reliably infers the statistical 
structure of processes with finitely many causal states. 
CSSR's predictive performance is at least comparable 
to cross-validated expectation-maximization, but it is 
constructive and faster; and the class of processes it 
can represent is strictly larger than those of competing 
constructive methods, such as variable-length Markov 
models. 

Two directions for future work suggest themselves. (1) 
CSSR does not require prior knowledge about system 
dynamics, but by the same token cannot exploit such 
knowledge when it exists. One way around this would 
be to initialize the algorithm with a non-trivial parti- 
tion of histories, reflecting a guess about which pat- 
terns are dynamically important, and let CSSR revise 
that partition the way it does now. It would be in- 
teresting to know when CSSR could correct an erro- 
neous initial partition. (2) HMMs are models of dy- 
namical systems without inputs. Partially-observable 
Markov decision processes (POMDPs) model systems 
with inputs, i.e., controlled dynamical systems. The 
causal state theory we have used generalizes naturally 
to this setting ^ ch. 7], where it is especially closely 
connected to PSRs. Suffix-tree methods can induce 
POMDPs from data [ZS E3 EHj , and we believe CSSR 
can be adapted to reconstruct POMDPs. This would 
provide a discovery procedure for PSRs. 
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